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Abstract 

In this work we deal with parameter estimation in a latent varia- 
ble model, namely the multiple-hidden i.i.d. model, which is derived 
from multiple alignment algorithms. We first provide a rigorous forma- 
lism for the homology structure of k sequences related by a star-shaped 
phylogenetic tree in the context of multiple alignment based on indel 
evolution models. We discuss possible definitions of likelihoods and 
compare them to the criterion used in multiple alignment algorithms. 
Existence of two different Information divergence rates is established 
and a divergence property is shown under additional assumptions. This 
would yield consistency for the parameter in parametrization schemes 
for which the divergence property holds. We finally extend the defini- 
tion of the multiple-hidden i.i.d. model and the results obtained to the 
case in which the sequences are related by an arbitrary phylogenetic 
tree. Simulations illustrate different cases which are not covered by our 
results. 

1 Introduction 



Biological sequence alignment is one of the fundamental tasks in bioinforma- 
tics. Sequences are aligned to identify regions of similarities that can be used 
to determine structural and functional motifs in a sequence, to infer gene 
functions or to derive evolutionary relationships between sequences. Aligning 
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two sequences, which are supposed to descend from a common ancestor, con- 
sists in retrieving the places where substitutions, insertions and deletions have 
occurred during evolution. The first alignment methods, namely scored-based 
methods, used dynamic programming algorithms with fixed score parameters 
to find an optimal alignment (see Durbin et ai, 1998, for an overview). But 
since an alignment aims at reconstructing the evolution history of the se- 
quences, choosing these score parameters in the most o bjective way to have 
an evolutionary meaning seems to be an important issue. iThorne et al. fll99lh 
proposed the first rigorous model of sequence evolution including indels (in- 
sertions and deletions), referred to as the TKF91 model. Based on this model, 
they were the first to provide a maximum likelihood approach to jointly esti- 
mate the alignment of a pair of DNA sequences and the evolution parameters. 
The alignment problem in this co ntext fits into t he pa ir hidden Markov model 
(pair-HMM), as first described in lDurbin et al\ (119981 ). ensuring the existence 
of efficient algorithms based on dynamic programming methods to compute 
the likelihood of two sequences and retrieve an alignment. That is one of the 
reasons why TKF91 based alignmen t methods have be c ome popular. I ndeed , 
they have been furt her developed inlHein et al. fl200d ). [Mazier et al\ f lioOlh . 



Metzlerl (120031 ) and iMiklos et al\ ( 120041 ) among others, and this despite the 



lack of theore tical support for t he est imation procedures in this framework 
during years. lArribas-Gil et al\ ( 120061 ) were the first to study the statistical 
properties of parameter estimation procedures in pair-HMMs. 

In the last years these methods have also been extended to the case of mul- 
tiple alignment. In this context we deal with more than two sequences and we 
have to take into account the evolutionary relationships between the sequences, 
which are represented by a phylogenetic tree. Multiple alignme nt methods ap- 
plying the TKF91 model o n a t r ee are for instanc e tho se of ISteel and Hein 
( l200l[ i. lHolmes and Brun^ (l200lh . iHein et al\ (l2003h and lhunter et al\ ( l2003h . 
They generalize pair-HMMs to more complex hidden variable models and pro- 
pose maximum likelihood or Bayesian approaches for the joint estimation of 
evolution parameters and multiple alignments given a phylogenetic tree. How- 
ever, since both alignment and phylogenetic tree aims at reconstructing the 
evolutionary history of the sequences, estimating the alignment from a fixed 
phylogenetic tree may biased the result. The ideal procedure would consist in 
jointly estimating alignments and phylogenetic trees from a set of unaligned 
sequences. This p roblem has been recen t ly tackled, in t he co ntext of indel evo- 



l ution models, by lFleissner et al\ ( 120051 ) . iLunter et al\ (120051 ) and lNovak et al. 
( 120081 ). However, as it was the case during years for the pair-HMMs, no 
theoretical support is provided for the estimation procedures in any of these 
contexts. 
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This work is concerned with the study of statistical properties of parame- 
ter estimation in latent variable models derived from multiple alignment algo- 
rithms where the phylogenetic tree relating the observed sequences is supposed 
to be known. The paper is organized as follows. 

In Section 2, we motivate the problem, discuss some models of sequence 
evolution and describe the homology structure in the context of multiple align- 
ment of a set of sequences related by a star-shaped phylogenetic tree and 
evolving under the TKF91 model of sequence evolution. 

In Section 3 we present the multiple-hidden i.i.d. model on a star tree. We 
discuss possible definitions of likelihoods and compare them with the criterion 
which is actually considered in multiple alignment algorithms. We analyze the 
case in which only two sequences are considered to show that our model is 
consistent with the pair-HMM. 

In Section 4, we investigate asymptotic properties of estimators under the 
hidden i.i.d. model for the definitions of likelihoods that we have considered. 
We first prove the existence of Information divergence rates, which are the 
difference between the limiting values of the log-likelihoods at the (unknown) 
true parameter and at another parameter value. We then prove that they are 
uniquely minimized at the true value of the parameter (divergence property) 
for some parametrization schemes. Following classical arguments, this would 
yield consistency for the parameter in those cases in which the divergence 
property holds. 

In Section 5 we extend the definitions of the multiple-hidden i.i.d. model 
and the results obtained to the general case in which the sequences are related 
by an arbitrary phylogenetic tree. 

Finally, in Section 6, we illustrate via some simulations the behavior of 
the divergence rates in different cases in which the divergence property is not 
established. The paper ends with a discussion on this work. 

2 Motivation: models of sequence evolution 
and the homology structure 

In the multiple alignment problem the observations consist in k {k > 2) 
sequences Xl.^_^, X^.^^, where rii is the length of sequence i and = 
X| . . . X^., with values in a finite alphabet A (for instance A = {A, C, G, T} 
for DNA sequences) . It is assumed that the sequences are related by a phylo- 
genetic tree, that is, a tree where the nodes represent the sequences and the 
edges represent the evolutionary relationships between them. The observed se- 
quences are placed at the k leaves of the tree, whereas the inner nodes stand for 
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ancestral (non-observable) sequences. The most ancestral sequence is placed 
at the root, TZ, of the tree. The choice of the root assigns to each edge a 
direction (from the root to the leaves) and to each inner node its descendants 
nodes, but since the evolutionary process between the sequences is usually 
assumed to be time reversible, the placement of the root node is irrelevant (cf. 
Thatte, 2006). A path from the root to a leaf represents the evolution through 
time and through a series of intermediate sequences of the ancestral sequence, 
leading to the corresponding observed sequence. The evolution on each edge 
(from its parent node to its child node) is described by some evolution process. 
We assume that the same evolution process works on every edge of the tree. 
A main hypothesis is that the evolution processes working on two edges with 
the same parent node are independent, i.e. a sequence evolves independently 
to each one of its descendants. 



2.1 Models of sequence evolution 

Mutations in a sequence during the evolution process can be produced by many 
different factors. However, there are two evolutionary events that play a major 
role: substitutions of a nucleotide by a different one in a given position of a 
sequence, and insertions or deletions of single positions or sequence fragments. 

The process of substitutions has been studied in depth during years, and 
is usually taken to be a continuous time Markov chain on the state space 
of nucleotides (Felsenstein, 2004; Tavare, 1986). The process of insertions 
and deletio ns has not rec e ived t he same attention and there is more place for 



discussion. iThorne et al\ (119911 ) proposed in a pioneering paper the first indel 
evolution model, and since then many variants have been considered. The 
importance of this model is that it makes the alignment fit into the concept 
of pair-HMM, as we have already mentioned. 

In the pair-HMM for pairwise sequence alignment the indel process and 
the substitution process are combined to model the whole evolution process. 
Indeed, the hidden Markov chain corresponds to what we usually call the 
hare alignment, that is, an alignment without specification of the particular 
nucleotides at each position of the sequences. Conditionally on a realization 
of this hidden process, the observed sequences are emitted according to the 
substitution model (see Durbin et al, 1998, and Arribas-Gil et al, 2006, for 
details). 

So, in the pair-HMM the indel evolution process characterizes the hidden 
stochastic process of the alignment, whereas the substitution process corres- 
ponds to the emission functions of the observed sequences. As we will see, 
that is also the case for the multiple alignment model that we study in this 
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paper. Since the asymptotic properties of estimators in such a model are more 
related to the structure of the hidden process than to the emission functions, 
which can take a general form (see Arribas-Gil et ai, 2006), we will focus our 
attention on the indel process. 



2.1.1 The TKF91 model 

Let us briefly recall how the TKF91 model works on pairwise alignments. 
This model is formulated in terms of links and associated letters. To each link 
is associated a letter that undergoes changes, independently of other letters, 
according to a reversible substitution process. The insertion and deletion 
process is described by a birth-death process on these links. Indeed, a link 
and its associated letter is deleted at the rate /i > 0. While a link is present it 
gives rise to new links at the rate A. A new link is placed immediately to the 
right of the link from which it originated, and the associated letter is chosen 
from the stationary distribution of the substitution process. At the very left 
of the sequence is a so-called immortal link that never dies and gives rise to 
new links at the rate A. We need the death rate per link to exceed the birth 
rate per link to have a distribution of sequence lengths. Indeed, if A < /i then 
the equilibrium distribution of length sequence is geometric with parameter 

Let Pn{t) be the probability that a normal link survives and has n des- 
cendants, including itself, after a time t. Let (t) be the probability that 
a normal link dies but leaves n descendants after a time t. Finally let Pn(t) 
be the probability that an immortal link has n descendants, including itself, 
after a time t. Here H stands for homologous, for non-homologous and / 
for immortal. We have: 



where 



p^{t) = e-^*[l - A/?(t)][A/3(t)]"-i forn>l 

p^lt) = ^(3{t) for n = 

= [1 - e-^* - ^il3{t)] [1 - \(3{t)] [\(3{t)f-^ for n>l 

piit) = [i-\pmmt)r~' for n>i 



1 _ p(A-At)t 

m 



/i - Ae(^-/^)* ■ 

Conceptually, e~^* is the probability of ancestral residue survival, XP(t) is 
the probability of more insertions given one or more existent descendants and 



-l-it _ 



K{t) := ^ is the probability of insert ion given that the ancestral 



residue did not survive. See iThorne et al.\ ( 1l99ll ) for details. 
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If we want to investigate the asymptotic properties of parameter estimators 
we must consider observed sequences of growing lengths. However, this is not 
possible under the hypothesis of the TKF91 model. Indeed, the ancestral 
sequence length distribution depends on A//^, and so, for a given value of 
these parameters we can not make the ancestral sequence length to tend to 
infinity. As one would expect (and as we will show later) the lengths of the 
observed sequences are equivalent to the length of the root sequence, so under 
this setup we can not expect to observe infinitely long sequences. 

Following the ideas in Metzler (2003), we will consider the case in which 
the TKF91 model can produce long sequences, that is, the case where \ = fi. 
With this configuration, finite length sequences are to be considered as cut 
out of very much longer sequences between known homologous positions. The 
length of the ancestral sequence is now considered to be non random. 

We will note g^(t) and (t) the probability distributions of the number 
of descendants for a normal link under these assumptions. We do not need 
to consider the distribution for the immortal link anymore, since now all the 
positions on the observed sequences are descendants of normal links. 
Since \im^^xP{t) = ^i^^ we get 

1 / At 

^^it) = lim it) = e--^ j forn > 1 

9;fW = limp^(t) = for n = (2) 



n-l 



^ e-A -V ( 1 for n > 1 



+ Xt J 1 + Xt \1 + Xt 

The main drawback of the TKF91 model is that insertions and deletions 
can only be produced at one nucleotide at a time. More realisti c indel evolu- 



(1992 


), 


Miklos et al. 


(2004) or 



plicity, in this work we will just consider the TKF91 indel model. However, 
the homology structure and the multiple-hidden i.i.d. model presented here 
can be extended to the case in which other indel models are considered. 



2.2 A star tree 

Let us now consider a fc-star phylogenetic tree, that is, a tree with a root, 
k leaves and no inner nodes. See Figure [1] for an example. We will note tj, 
i = 1, . . . ,k, the branches lengths, that is the evolutionary time separating 
each sequence to the root. In this context, an alignment of the k sequences 
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7^: ACCT 




ACCGGT ACT ACT GCCAT CCT ACCT 



Pairwise alignments: 
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Figure 1: A 6-star phylogenetic tree and example of multiple alignment. TZ 
stands for the ancestral sequence and X* stand for observed sequences. Letters 
in dark represent descendants of a position in the ancestral sequence whereas 
letters in gray represent insertions. 

and the root consists in a composition of the k pairwise ahgnments of the root 
with any of the observed sequences. This is done as follows. Two characters 
and will be aligned in the same column if and only if they are homolo- 
gous to the same character of the root sequence. So there is a column for each 
nucleotide at the root containing all its homologous positions on the leaves, 
and between two columns of this kind, there is one column for each inserted 
position on the leaves between the two corresponding nucleotide positions at 
the root. Insertions to the root sequence occur independently on each sequence 
and we assume that the probability of having two insertions on different se- 
quences at the same time is 0. That is why insertion columns are composed by 
one nucleotide position in some of the sequences and gaps in all the others. We 
know that under the TKF91 indel model the pairwise alignment is a Markov 

B ~ B 

chain on the state space {b,b,_} (see Metzler et a/., 2001). Let us precise that 
from now on the word alignment will denote indistinctly the whole alignment, 
that is the reconstruction of the whole evolution process, including substitu- 
tions, of a set of sequences, or, as in this case, the hare alignment, that is, the 
reconstruction of the indel process only. We recall that when we model the 
alignment of a set of sequences as a hidden variable model, the hare alignment 
is which corresponds to the hidden process. 

In contrast to the pairwise alignment case, when we apply the TKF91 in- 
del evolution model to multiple alignment we do not get a Markov chain on 
the set of all possible multiple alignment columns. In fact, Markov models 
for multiple alignment exist but states do not exactly correspond to align- 
ment columns. Indeed, insertion states in these models describe not only an 
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insertion on one sequence but also a kind of "memory" of what is happening 
in other sequences (see Holmes and Bruno, 2001, and Hein et ai, 2003, for 
instance). This is because the Markov dependence for pairwise root-leaf align- 
ments applies independently on each sequence due to the branch independence 
of the evolution process. So that, an alignment column describing an insertion 
on sequence i depends on the last column of the alignment describing any 
evolutionary event on sequence i, but there may be several alignment columns 
describing insertions on other sequences between these two columns. See Fi- 
gure [2] for an illustration. 

So one could say that insertions to the root sequence break the Markov depen- 
dence between alignment columns. Also, the order of the insertions between 
two homologous positions is irrelevant, the only important fact being which 
positions are homologous to which (see for instance the multiple alignment in 
Figure [1] where the insertion columns are completely exchangeable). Then, 
the interesting objet is not the alignment but the homology structure, essen- 
tially an alignment of homologous positions with specification of the number 
of insertions on each sequence between any two homologous positions. The 
homology structure can be described in terms of the nucleotides at the root 
sequence. Indeed the homology structure is just the sequence of root posi- 
tions in which we specify, for each ancestral residue, its fate (whether it has 
survived or been deleted) and all the insertions occurred to its right in each 
one of the observed sequences (see Figure [3] for an example) . The homology 
structure is, as the bare alignment, a reconstruction of the indel process of a 
set of sequences. 

In the TKF91 indel model, evolution on each link is independent of evolution 
on other links (see Thorne et ai, 1991). That is why the homology structure 
under these models can be described as a sequence of i.i.d. random variables 
as we will see in the next section. 



2.3 The homology structure on a star tree 

Consider a fc-star phylogenetic tree T with branches lengths ti, . . . ,tk- The 
homology structure of the sequences related by T is a sequence of indepen- 
dent and identically distributed random variables {en}n>i- The variable 
represents the fate of the n-th ancestral sequence character (or fragment, if 
we consider fragment indel evolution models). Its distribution will depend on 
the chosen indel evolution model. Under the TKF91 indel evolution model 
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Figure 2: Alignment Markov chain for a star tree with two l eaves and sequences 
evolvi ng under the TKF91 evolution model as described by lHolmes and Bruno 
(1200 A) . The bubbles represent the states of the chain. R stands for a base on the 
ancestral sequence and B for a base on any of the observed sequences. Letters 
in brackets appear on insertion states and stand for the last event recorded on 
each sequence. The left column in insertion states is the true state whereas 
the right one is its representation in the alignment. There are two different 
states to represent an insertion on the second sequence since they have also 
to represent the fate (conservation or deletion) of the root nucleotide in the 
first sequence. That is because transitions from an insertion on the second 
sequence to an insertion on the first sequence depend on the fate of the last root 
nucleotide on the first sequence. Insertions on the second sequence are written 
in the alignment before insertions on the first sequence. The probabilities on 
the edges are those of (Qy and a(t) = e~^^ . 
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Bare alignment Homology structure 
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Figure 3: The bare alignment and the associated homology structure for the 
multiple alignment of Figure Ui There are two columns for each position in 
the ancestral sequence: in the first one, 1 stands for a nucleotide that has been 
conserved and a for a nucleotide that has been deleted; the second column 
represents the number of insertions in each sequence to the right of the ances- 
tral nucleotide. In the homology structure there is no artificial order between 
insertions in different sequences. 

{^n}n>i is a sequence of i.i.d. random variables on 

= {(e(l), e(2)) = a^-'') \ 6' G {0, 1}, a' > 0, i = 1, . . . , k} . 

The first column of corresponds to the homologous positions to the n- 
th ancestral character. If it is conserved in sequence i, i = l,...,k, then 
e\{l) = 1, else = 0. It is possible for an ancestral character to have been 
deleted in all the observed sequences (en(l) = 0^, where 0^ stands for the k- 
dimensional vector with all components equal to 0). The second column of £„ 
represents the number of insertions on the observed sequences between the n- 
th and the (n-|- l)-th ancestral sequence characters. It is possible to have none 
insertions in any of the observed sequences between two homologous positions 
(£:„(2) = Ofc). See Figure [3] for an example of an homology structure. 

Due to the branch independence, the law of n > 1, under the TKF91 
indel model, is given by 

i=l 

Conditionally to the result of the indel process (the bare alignment), nu- 
cleotides on the observed sequences are emitted according to some substitution 
process. In practice, most nucleotide substitution processes are described by 
a continuous time Markov chain defined on A and depending on the branches 
lengths (see Felsenstein, 2004, for instance). Let us note u the stationary law of 
this process and Pt{-,-) the transition probability matrix for a transition time 
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t > 0. Then, for n > 1, if £„, = {S^'''', a^-'^), r = Yli=i ^* nucleotides are emitted 
in the conserved positions according to the joint probability distribution hj, 
J = {i\Si = 1}, on with 

r 

h{.,_,^}{x'\ ...,x'^) = Y,u{R) Hp,^^ {R, x'^), (4) 

ReA j=l 

where R represents the unknown ancestral nucleotide. Note that hj does not 
only depend on the cardinal of J, but also on its elements via the branches 
lengths {tj}i=i,...,fc. In the inserted positions, XliLi nucleotides are emitted 
independently and identically distributed according to the probability distri- 
bution /(■) = z/(-). 

In classical substitution processes there is independence between the di- 
fferent sites of the ancestral sequence. That means that conditionally on 
{en}n>i, the emissions of nucleotides on the observed sequences at different 
instants (positions of the ancestral sequence) are independent and equally dis- 
tributed as described below. 

3 The multiple-hidden i.i.d. model on a star 
tree 

We present in this section the multiple- hidden i.i.d. model, where multiple 
refers to the number (> 2) of observed sequences and i.i.d. to the nature 
of the hidden process, by analogy to the name of the pair-hidden Markov 
model. The homology structure of k sequences evolving under the TKF91 
indel evolution model and a particular substitution model, as described in the 
precedent section, is a particular parametrization of this model. 

Consider a sequence of i.i.d. random variables {£„}„>! on the state space 

£^ = {(e(l), e(2)) = a^'^) | 5* G {0, 1}, a' G N z = 1, . . . , k} 

with distribution tt. 

The process {en}n>i generates a random walk {Zn}n>o with values on 
N^- by letting Zq = Ok and Z„ = Ei<,<n[^i(l) + ^j(2)] for n > 1. The 
coordinate random variables corresponding to Zn at position n are denoted 
by {Zl, . . . , Z^) {i.e. Zn = {Z^, . . . , Z!^)). In the homology structure context 
they represent the length of each observed sequence up to position n on the 
ancestral sequence. 

Let us now describe the emission of the observed sequences which take 
values on a finite alphabet A. We distinguish to kinds of emissions, joint 
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emissions across k or a. smaller number of sequences (corresponding to 
and single emissions (corresponding to £:„(2)). For n > 1, if e„ = {S^-'',a^-^) 
then a vector of r = Yl^=i ^-^^ emitted according to some probability 
distribution hj, J = {i\6^ = 1}, on and Yli=i'^^ {"^la" — 
i = l,...,k, are emitted according to the following scheme: {^jYjZ^i'li are 
independent and identically distributed from some probability distribution / 
on A. 

Conditionally to the process {en}n>i, the random variables emitted at di- 
fferent instants are independent. The whole multiple-hidden i.i.d. model is 
described by the parameter 9 = (tt, {hj}jcK, f) £ ©, where K = {1, . . . , k}. 
We do not consider the branches lengths as a component of the parameter and 
assume they are known. 

The conditional distribution of the observations given an homology struc- 
ture ei;n = {ej)i<j<n = {{5f-'',a]-''))i<j<n, writes 

^eO^lk-.zJ^l-.n = ei:n, {£m}m>n, {^ni}i&K,n^>Z^) = ^^(Xlj.:^^ I^lin = ^l-.n) 

n 

n k '^j 

j = l 1 = 1 s = l 

where 1^ stands for the /c-dimensional vector with all components equal to 
1 and Xi = (X^.^i , . . . , Xf.^fe). This notation can be confusing since it 

is possible to have + > for some i E K and for some j > 1. 
However when writing ^Zj^l+l^:■.ZJ we will only be considering the variables 
corresponding to those sequences i E K for which + Ik Zj. 
The complete distribution Fg is given by 

n n 

= Fg{X^^.,zJe,.,n = ei,n)l[Fg{e, = e,) = Fg(X^^.,zJe,.,n = ei.^ H^^^i) 

i=i i=i 

We denote by Fg (and Kg) the induced probability distribution (and correspon- 
ding expectation) on x (^^)'^ and 9o = (ttq, {hoj}jcK, /o) the true param- 
eter corresponding to the distribution of the observations (we shall abbreviate 
to Fq and Eq the probability distribution and expectation under parameter 
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3.1 Observations and likelihoods 

As in the pair-HMM (see Arribas-Gil et ai, 2006) there are different interpre- 
tations of what the observations represent on this model, and thus different 
definitions for the log-likehhood of the observed sequences . . . ,X^.^J. 

However, the difference with the pair-HMM is that in the muhiple-hidden 
i.i.d. model we suppose that the observed sequences are cut out of very much 
longer sequences between known homologous positions. This implies that any 
interpretation of what observations represent must assume that the underlying 
process {£:n}n>i passes through the points 0^ and (rii, . . . ,nk). 

One may consider that what we observe are sequences that have evolved 
from an ancestral sequence of length n so that the likelihood should be Fe{^if.:z„) 
= Fg{Xi^-_Zn, Zn)- This term is computed by summing, over all possible ho- 
mology structures from an ancestral sequence of length n, the probability of 
observing the sequences and a homology structure. 

Let us define £ni,...,nk the set of all possible homology structures of k se- 
quences of lengths rii, . . . , rifc: 

n 

^ni,...,n, = {e e neN, J]] |e,| = (m, . . . , n^)}. (6) 

i=i 

For any homology structure e G £ni,...,nky if e G {£'')"', then n is the length of 
the path e and is denoted by |e|. In the homology structure context, |e| stands 
for the length of the ancestral sequence. So we have 

ee£^z„;|e|=n 

Then, we would define the log-likelihood in{0) as 

C(^)=logP,(Xi,:zJ, n>l. (7) 

But since the underlying process {Zn}n>o is not observed, the quantity £ri(6') 
is not a measurable function of the observations. More precisely, the length n 
at which the observation is made is not observed itself. Though, if one decides 
that . . . , -^hnj.) corresponds to the observation of the emitted sequences 

at a point of the hidden process Z„ = (^^)i=i,...,fc and some unknown "ancestral 
length" n, one does not use in{0) as a log-likelihood, but rather 

Wn{e)=\ogQe{X,^:zJ, n>l (8) 

where for any integers rij , z = 1 , . . . , 

Q,(XL„ . . ..X^J = P,(3m > 1, = (m, . . . ,n,);XL„ • • • (9) 



13 



In other words, Qg is the probabihty of the observed sequences under the 
assumption that the underlying process {£ri}n>i passes through the point 
(wi, . . . , rifc). But the length of the ancestral sequence remains unknown when 
computing Qg. This gives the formula: 

Q,(XL„...,XLJ= P.(^i:H = e,XL„...,XL,). (10) 

Let us stress that we have 

Wn{e) = logPe(3m >l,Zm = (Z;),=i,...,fc; Xj^^i, . . . , X^^,), n > 1, 

meaning that the length of the ancestral sequence is not necessarily n, but is 
in fact unknown. 

In the homology structure context, Qg is the quantity that is computed by 
the multiple alignment algorithms (see for instance Holmes and Bruno, 2001, 
Steel and Hein, 2001, or Lunter et ai, 2003) and which is used as likelihood in 
biological applications. The more extended application is to use this quantity 
to co-estimate alignments and phylogenetic trees in a Bayesian framework via 
MCMC calculations (cf. Fleissner et ai, 2005; Lunter et ai, 2005; Novak et 
al, 2008). Indeed, algorithms that perform this joint estimation compute, at 
each iteration, the likelihood of sequences for a given phylogenetic tree. Thus, 
asymptotic properties of the criterion Qg and consequences on asymptotic 
properties of the estimators derived from Qg are of primarily interest. 

We will look for asymptotic results for n — * oo. We need to establish some 
kind of relationship between n and ni, . . . , Uk, to derive asymptotic results for 
Ui — > oo. From our definition of the multiple-hidden i.i.d. model, it is clear 
that it does not exist a deterministic relationship between the length of the 
hidden sequence and the lengths of the observed sequences. However, in the 
multiple alignment problem, a natural assumption is that very big insertions 
and deletions occur rarely and thus the length of the root sequence should 
be equivalent to the lengths of the observed sequences. In fact we have the 
following result. 

Lemma 1 In the multiple- hidden i.i.d. model on a star tree under the TKF91 
indel evolution process, that is, when tt is the distribution given by for any 
X > we have Zl^ n, i = 1, . . . , k, Fx-almost surely. 

Proof. For all i = 1, . . . ,k and for all n > 1 we have that 

n 
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where {£j}j>i are i.i.d. Moreover, from ([2]) we have, for any A > 



E,\e]{l) + e){2)] 

= ^m{PA(£}(l) +4(2) = m,4(l) = 0)+Pa(4(1) +4(2) = m,4(l) = 1)} 

m>l 

■m>l 

" S"" {(l + ~ ') 1 + Ati (l + At J \ + XU (l + At J } 

= 1. 

Now the result holds from the strong law of large numbers. □ 

According to this lemma, under the TKF91 indel evolution model, asymp- 
totic results for n ^ oo will imply equivalent ones for rii — oo, i = 1, . . . , k. 
Let us establish an assumption to get the same result for the general multiple- 
hidden i.i.d. model. 

Assumption 1 In the multiple-hidden i.i.d. model on a star tree Eg [£„(!) + 
en{2)] = Ik, for n>l, for any 9 eQ. 

3.2 The case of two sequences 

Let us consider the case in which k = 2. It is clear that the general multiple- 
hidden i.i.d. model and the pair-HMM are different in this case. However, 
in the context of the alignment of two sequences evolving under the TKF91 
model, the two models are equivalent. In fact, in the pairwise alignment we 
consider that one of the sequences is the ancestor of the other one, but since 
the TKF91 model is time reversible, this is equivalent to consider that both 
sequences evolve from a common unknown ancestor. 

First of all, let us remark that the likelihood (Qe) of two sequences Xi-n and 
yi._m is the same under the two models. Let t be the evolution time between 
both sequences, that is, the sum of the evolution times between the root and 
each one of the sequences, ti + 12, in the multiple alignment setup. Consider 
for the pair-HMM the following transition matrix: 
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D 



H 



V 





D 


H 


V 




f 


a{t) 


1 - a{t) 


Xt 


\ 




(1 + At) 


(1 + Xt) 


(1 + At) 






(1 - K{t))a{t) 


{I - K{t)){l - a{t)) 


nit) 






a{t) 


1 - a{t) 


Xt 






(1 + At) 


(1 + At) 


(1 + At) 


/ 



where D, H and V stand for di agonal, horizontal and v ertical movements 
respectively, with the notations of lArribas-Gil et al\ (120061 ) . and a(t) = e~^*, 
K(t) = 1 — (i+Af)('i-a(t)) • ^^^y show that the probability of an ho- 

mology structure (under the multiple-hidden i.i.d. model) is just the sum of 
the probabilities of all possible alignments (under the pair-HMM) leading to 
that homology structure. Then, the sum over all possible alignments and all 
possible homology structures of two sequences is equivalent. 

Finally, note that for the transition matrix in fill I) the stationary probabi- 
lit ies of insertions and dele tions are the same, that is p = g with the notations 
of lArribas-Gil et al\ (120061 ) . That means that we are in the case where the 
main direction of the alignment, that is, its expectation under the pair-HMM, 
is always the straight line from (0, 0) to (n, n) for every value of the parameter. 
This is also the case in the multiple-hidden i.i.d. model as we have shown in 
Lemma [TJ 



4 Information divergence rates in the star tree 
model 

4.1 Definition of Information divergence rates 

In this section we prove the convergence of the normalized log-likelihoods in{0) 
and ujn{(^)- Let us note 

Bo = e e I 7r(e) > 0, hj{x'--\'\) > 0, f{y) > 0, 

Ve e ^^ Vx^^l^l e ^1^1, VJ CK, Vy e A} . 

We shall always assume that Oq G Qq- 
Theorem 1 The following holds for any 6* G Oq-' 

i) n~^ln{d) converges FQ-almost surely and in Li, as n tends to infinity to 

i{9) = lim -Eo (logPe(Xi,;zJ) = sup -Eo (logFeiX^.-.zJ) ■ 
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a) n ^Wn{0) converges FQ-almost surely and in hi, as n tends to infinity to 

w{9) = lim -Eo (logQ,(Xi,:zJ) = sup -Eq (log QeiX^^-.z J) • 
n^oo n n n 



Using the terminology of lArribas-Gil et al\ (120061 ) we then define Information 
divergence rates: 

Definition 1 G 0o, D{e%) = w{eo) - w{e) and D*{e\eo) = i{9o) - 



We recall that D* is what is usually called the Information divergence rate 
in Information Theory: it is the limit of the normalized Kullback-Leibler di- 
vergence between the distributions of the observations at the true parameter 
value and another parameter value. However, we also call D an Information 
divergence rate since Qg may be interpreted as a likelihood. 

Proof of Theorem [1] , This proof is similar to the proof of Theorem 1 in 
Arribas-Gil et al\ ( 120061 ). We shall use the following version of the sub-additive 
ergodic Theorem due to iKingmanI (119681 ) to prove point i). A similar proof 
may be written for ii) and is left to the reader. 
Let [W,, ,t)o<s<t be a sequence of random variables such that 

1. For all m<n, Wo,n > W^O,m + Wm,n, 

2. For all / > 0, the joint distributions of {Wm+i,n+i)Q<m<n are the same as 

those of (Wm,n)Q<m<n, 

3. Eo{Wo,i) > -oo. 

Then lim„n"^lVo,n exists almost surely. If moreover the sequences (Wm+/,n+/)/>o 
are ergodic, then the limit is almost surely deterministic and equals sup„n"^Eo(Wo,n) 
If moreover Eo(VFo,n) < An, for some constant A > and all n, then the con- 
vergence holds in Li. 

We apply this theorem to the process 



Wm,n = logPe(Xz^+l,:zJ, < m < U. 
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Note that since Zq = 0^ is deterministic, we have Wo^n = ^ogFg{Xi^,z^] 
Super-additivity (namely point 1.) follows since for any < m < n, 



e&£z„ 
\e\=n 



e&Szm e'G£'z„_z„ 
\e\=m \e'\=n 



e&£Zra (i'&£Zn-Zm 

\e\=m \e'\=n-m 



so that we get Wo,n > W^o,m + Wm,n, for any < m < n. 

To understand the distribution of (Wm,n)o<m<n, note that Wm.^n only de- 
pends on trajectories of the random walk going from the point (Z^, . . . , Z^) 
to the point (Z^, . . . , Z^) with length n — m. Since the variables (£n)n>i are 
i.i.d., one gets that the distribution of {Wm,n) is the same as that of {Wm+i,n+i) 
for any /, so that point 2. holds. 

Point 3. comes from: 

Pe(Xi,;zJ = Pe(£i = e)P,(Xi,,zj£i = e) 

eGfzi 
|e|=l 

|e|=l 

Pq- almost surely, since 6* G Go, provided that Z\ > 1 for some i & K. So 
Eo(W^o,i) =EologP^,(Xi,;zJ > -00. 

Let us fix < m < n. The p roof that W^'* = (Wm+i,n+i)i>o is ergodic is 
the same as that of iLerouxl (119921 ) (Lemma 1). Let T be the shift operator, so 
that if M = {ui)i>o, the sequence Tu is defined by (Tu)i = (m)«+i for any / > 0. 
Let B be an event which is T-invariant. We need to prove that Po(py"'"' G B) 
equals or 1. For any integer i, there exists a cylinder set Bi, depending 
only on the coordinates ui with —ji < I < ji for some sub-sequence ji, such 
that Po(W^''"'" G BABj.) < 1/2*. Here, A denotes the symmetric difference 
between sets. Since W"^'"^ is stationary and B is T-invariant: 

Po (Py""'" G BABj^) = Po (T^^W™'" G BABj^) = Pq (M/""'" G BAT-'^^^Bj^) . 
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Let B = ni>i Uh>^ T-^^'^Bj^. Borel-Cantelli's Lemma leads to Po(iy'"'" G 
BAB) = 0, so that Fo{W"'''' e B) = Po(M^'"'" & B) = PqIW^"'" E B n B). 
Now, conditional on (e„)„,eN5 the random variables (Wm+i,n+i)i>o are strongly 
mixing. Indeed Wm.+i,n+i only depends on a finite number of other (Wm+k,n+k) , 
k > 0, namely (Wm+k,n+k)k=max[i,'m+i-n+i),...,n+i-m- Then the - 1 law for 
strongly mixing processes (see Sucheston, 1963) implies that for any fixed 
sequence e with values in {S'')^, the probability Po(iy™'"' G B\{en)n = e) 
equals or 1, so that 

Po (W^™'" G 5) =Po GC) 

where C is the set of sequences e such that Po(iy™''" G -B|(£:„)n = e) = 1. But 
it is easy to see that C is T-invariant. Indeed, if e G C then, since ly™'" is 
stationary and B invariant, 

1 = Po(H^-'" G B\{en)n = e)= Po(TPy"^'" G ^Ke^n = Te) 

= Po(l^^'"'"G5|(£„)„ = Te) 

so that Te G C. Now, since (en)n>i is an i.i.d. process, it is ergodic so 
Po {{£n)n G C) equals or 1. This concludes the proof of ergodicity of the 
sequence W"^'"". 

To end with, note that for any n > 0, the random variable VFo,n is non 
positive, ensuring the convergence of {n~^Wo,n} in Li. □ 

4.2 Divergence properties of Information divergence rates 

Information divergence rates should be non negative: this is proved below. 
They also should be positive for parameters that are different than the true 
one: we only prove it in a particular subset of the parameter set. Let us define 
the set 

Qmarg = {O E Bo : k'j = f,\/JC K, G j} . 

where /ij denotes the i-th marginal of hj. 
Theorem 2 Information divergence rates satisfy: 

• For all e G Oq, D{e\eo) > and D*{d%) > 0. 

• // 6q and 9 are in Qmarg, D{9\9q) > and D*{6\6q) > as soon as 
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Note that from Assumption 1 the expectation of £„(!) +£:„(2), n > 1, is the 
same for any value of the parameter. Thus, we can not estabhsh the positivity 
of the information divergence rates for values of 9 for which the expectation 
of the hidden process is different than under 6*0, as it is done for pair-HMMs 
(Theorem 2 of Arribas-Gil et ai, 2006). 

Also note that when we consider classical markovian substitution processes 
for the emission laws, as described in (jl]), the parameter always lies in Qmarg, 
since the marginal emission distributions are equal to the stationary distribu- 
tion of the Markov process. 



Proof. Since for all n, 

Eo (logPo(Xi,.^J) - Eo (logPe(Xi,.^J) 

is a Kullback-Leibler divergence, it is non negative, and the limit D*{6\6o) is 
also non negative. 

Let us prove that D{9\9o) is also non negative. To compute the value of 
the expectation Eo[iyn(^)], note that the set of all possible values of Zn is N'^. 
Then, 



EoK(^)] 

= E E M^n = K, . . . = . . . = xij 



Now, by definition. 



Die\e,)= hm -Eoflog^^i^ 



By using Jensen's inequality. 



Eo flog #^^^^1 < logEo ( ^^(^1-^" 



Qeo0^1k:Z„. 



Ik-Zn 



logEo 



Er 



Qei^i,:zJ 

Qooi^lk-.zJ 
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Now, for all (rii, . . . , rik) G 



Er 



(Qo{ 



{ni, ...,nk) 



(4:n,)Ll 



< P.(3m>l,Z„ = (ni,...,nfc),XL,=xU,...,XL, = xLJ 



Pe(3m > = (ni,...,nfc)) < 1 



where (a) comes from expression ffTOl) . Thus, Eq 
and 



Eo 



< 1, 



lim — (w„(6') — Wri(6'o)) < liminf — logEo 



Eo 



XQeoi^ik-.Zn] 



< 0. 



So finally 

e Go, /^(^l^o) > 0. 

Let us now consider the case where 6o and 6 are in Qmarg- Let us remark 
that for any 6 G &marg we have 



¥e{Zn = (ni, . . . ,nfc), = x^^^J 

= Y ^e{Zn = {ni,...,nk),Xl 



ni "^lini ; • • • ; 



X" \^ = T' 

ni •^linii ■ ■ ■ ■> ^^l-.rik l:nfc' 



1 ,„fc 



|e|=n 



^ Y Mei:n = e)Fg{Xl 

'"•1 !■ ■■ ; 

|e|=n 



rii "^lini ; • • • ; Im^ 



eef„,,...,„, (4^ )fc 



l:ni / 



(12) 



where the last equality comes from ([5]) . In the same way, for any 9 G Qmarg we 



have that Fg(3m < 1, Zm = {ni, . . . , nk),Xl. 



rii "^l-.rii, 



¥g{3m < 1,Z„ 
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(ni, . . . , '"'fc))/®"'HxJ.„J. This is also true for any other sequence i 
1 . . . ,k. Then, using Jensen's inequahty and definition ( fTOl) . 

Eo ( log 



Xlog Z^Cr^ — ^ 7 V 

y [^l■.nJ^=2 Po(z„ = (ni...,nfc),Xl„^=4„jQeo(4„iv..,4n,) / 

(ni...,nfe)eN'= x\.^_^ 



xlog 



Pe(3m > 1,Z„, = (ni,...,nfc))r-i(xLj 



Po(Z„ = (ni,...,nfc))/o^"^(a:LJ 
where the last inequality comes from fll2p and the fact that 
Po(Z„ = (m, . . . ,nfc), = . . .,X^^^ = xi^J < Qeoixi^,, ■ ■ • ,4nJ- 

Thus, we have 



-Z^(^l^^o) <limsup- Yl Po(^n = (ni,...,nfc)) 

n— »+oo TT- , , , 

(ni...,nfe)GN''- 

I Fo{Zn = {ni,. . . ,nk)) ^ /o(x) J 

<limsup-|log ^ P0(3m > 1, = (ni, . . . ,nfc)) 

+ Po(^n = K,...,nfc))niJ]/o(x)log^} 

< hmsup -{Eo[Z^]^/o(a;) log /M| ^Umsup ijn^ /o(a;) log <0, 

n— >+oo 



as soon as / 7^ /o, since Eo[Z^] = n from Assumption 1. 
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Figure 4: (a) An arbitrary phylogenetic tree, (b) A binary phylogenetic tree. 
TZ stands for the ancestral sequence, iV' stand for sequences in inner nodes 
(non-observed sequences), and stand for observed sequences. 



The proof for D* follow the same lines. □ 

It would be interesting to prove the uniqueness of the maximum of the 
functions l{9) and w{9) at the true value of the parameter 9q. If that was 
true, the consistency of maximum likelihood and bayesian estimators would 
be obtained with classical arguments (see Arribas-Gil et ai, 2006). In Section 
[6] we investigate the behavior of functions i{6) and w{6) via some simulations. 



5 Extension to the case of an arbitrary tree 

Let us now consider an arbitrary phylogenetic tree, that is, a tree with inner 
nodes such as the one in Figure|l](a). Without loss of generality we can assume 
that we deal with a binary tree (the number of edges going out from every 
inner node is equal to two) in which the length of the path from the root to 
each leaf is the same for every leaf in the tree. There is an example of this kind 
of tree in Figure H] (b). Indeed, we will only use this fact to simplify notations, 
since it allow us to describe the evolutionary behavior of any internal node in 
a general way and define the model in a simpler manner. Otherwise, the state 
space of the hidden process would depend on the particular structure of the 
tree, but the results given in this section still hold. 

The multiple-hidden i.i.d. model on a binary tree with k observed se- 
quences, 72,fc, is defined as follows. Consider a sequence of i.i.d. random 
variables {en}n>i on the state space 

8^2,, = |e e 7W(2._i),2™ , m G N; ej G {(1, 0), (0, 0)}, 

^{a,,^,^'} ^ {(1,0) X^^ Qg^a} , P = 1, • • • , m, V/i G /, Vi, i' G O, i ~ i' } 
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where Aia,b denotes the set of all a-hy-b natural matrices, 03^2 denotes the 
3-by-2 null matrix, / denotes the set of internal nodes (unobserved sequences) 
of the tree and O denotes the set of external nodes (observed sequences) of 
the tree. For an observed sequence i, ai stands for its direct ancestor, that 
is, the sequence that is placed in its closest internal node. For two observed 
sequences i and i', we write z ~ z' if they share the same direct ancestor (that 
is, Oj = Oj'). For e in S'^^''', e|*'"'''^^ is the sub- matrix of e composed by rows 
h and columns 2(p — 1) + 1 and 2p. 

An element e of f^^^.fe represents the fate of a nucleotide in the root sequence 
and all the insertions produced at the different levels of the tree. It is a 
finite sequence of (2^^ — l)-by-2 matrices, in which each row represents one 
node (sequence) of the tree. We will assume that the first row represents 
the sequence at the root. The first (2^^ — l)-by-2 matrix represent the fate of 
the nucleotide at the root in the first column (whether it is conserved, 1, or 
deleted, 0, in each one of the sequences) and the number of insertions produced 
to its right in the observed sequences (second column). The difference with 
the star-tree case, is that now we may also have (non-observed) insertions in 
the internal sequences. They appear in the following {2^ — l)-by-2 matrices, 
represented by 1 in the corresponding position of the first column, where we 
also represent the fate of the inserted nucleotide and the number of insertions 
produced to its right in the corresponding descendant sequences. The rows of e 
that are not concerned by that insertion (because the corresponding sequences 
are not descendant of that internal sequence) may represent the fate of another 
inserted nucleotide in a different internal sequence. That is why in the same 
(2^ — l)-by-2 matrix we may represent independent events in different rows. 
Indeed, the events represented in two different rows i and j of the same (2'^ — 1)- 
by-2 matrix are independent if the row corresponding to the closest common 
ancestor of i and j in that matrix takes the value (0, 0), and they are dependent 
if it takes the value (1,0). There is an example of an element of i^^^.fe 
Figure O 

For e G E^'^''^ such that e E A^(2fe-i),2m.) m eN, we will note |e| = m. Also, 
for any (2*''— l)-by-2 submatrix Cp, 1 <p < |e|, we will note ||ep|| = ep(l)-|-ep(2), 
that is, the sum of the two columns of Cp. e"^^ will denote the /i;-by-2|e| matrix 
whose rows are the rows on e corresponding to the observed sequences. For any 
internal node (non-observed sequence) i E I, di will denote the set of the two 
direct descendants of i, and Di will denote the set of all the descendants of i 
which are observed sequences. Also, for any sequence i, and any 1 < p < \e\, 
such that ep'(l) = 1, we will denote ||ep|| = X]-r=p ll^r-ll) where q is such that 
e"*(l) = for r = p -I- 1, . . . , g — 1 and e^*(l) = 1. ||e* || represents the total 
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Homology structure 
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Figure 5: An element e of S^^'*^ (corresponding to the phylogenetic tree in 
Figure|l](b)) and a possible representation of the associated multiple alignment 
(the choice of the order in which insertions appear is arbitrary). Vertical lines 
separate the different 15-by-2 submatrices. In this example, |e| =8. 

number of descendants in sequence i of the given nucleotide from sequence Oj. 
If i is one of the two direct descendants of the root then ||e^|| = Yl^r=i ll^rll- 
Note that if i stands for an observed sequence (external node), for any p such 
that ep'(l) = 1, llcpll = ||ep||. The same notations apply to the random process 

In the case in which we consider the TKF91 indel model, due to the branch 
independence, the law of is given by 

|e| 2*^-1 / \]l{ej,(l)=l} / N]l{e;(l)=0} 

n{en = e) = ll n U^iU)) ,eeS^^-^,n>l 

<=p'(l)=l 

where ti represents the evolutionary time between sequences i and Cj. In the 
general case we will note vr the law of £„. 

As in the star tree case, the process {en}n>i generates a random walk 
{Zn}n>o with values on N'' by letting Zq = 0^ and Z„ = Ei<j<n T.i<p<\sj\ 11^? H 
for n > 1. The coordinate random variables corresponding to Z„ at position 
n are denoted by (Z^ . . . , Z^) (z.e. Z„ = (Z^, . . . , Z^)). 

Let us now describe the emission of the observed sequences which take 
values on a finite alphabet A. We distinguish to kinds of emissions, joint 
emissions across /c or a smaller number of sequences (corresponding to £:„p(l), 
1 < P < kn|) and single emissions (corresponding to enp(2), 1 < p < \£n\)- 
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For n > 1, and for 1 < p < if e^pi^) = 1 ^np(l) = ^iny 

T ^ I, then a vector of r = |{i G Dt-|£^^p(1) = 1}| r.v. is emitted according 
to some probability distribution hj, J = {i E = 1}, on ^'^^ and 

X]ieDr^np(2) r.v. (2)}' ^ -^-T' emitted according to the following 

scheme: {^jY^^^ (^2) independent and identically distributed from some 
probability distribution / on A. 

Remark 1 In practice, the emission law h, may take into account the emi- 
ssions in internal sequences. Consider, for instance, the emission in the first 
column of the homology structure of Figure // we deal with a classical 
markovian substitution model, with stationary distribution v and transition 
probability matrix pt{- ,■) , the emission of nuleotides 
quences X^, . . . , X^, X'^, X® would have probability 



^{i,...,5,7,8}(2:\ . . . 

-RG^ I \riGA 



5GA 



.T4eA 

YPssi'r2,T5)pt^{T5,X^ 

Pse {t2, TQ)pt, (rg, x'^)pts (rg, x^ 

.7-6 G A 

where R represents the nucleotide in the root, Ti the nucleotide in internal 
sequence N^, Sj the evolution time to internal sequence from its direct 
ancestor and ti the evolution time to observed sequence X^ from its direct 
ancestor. 

As in the star tree case, conditionally to the process {en}n>i, the random 
variables emitted at different instants are independent. The whole multiple- 
hidden i.i.d. model is described by the parameter 9 = (tt, {hj}jcK, f) G 0. 
The conditional distribution of the observations given an homology struc- 



26 



ture ei;„ = (ej)i<j<„, writes 

n 

Pe(Xl,:Z„kl:n = ei:0 = nPe(Xz,-,+l,:Z,k, = 6,) 

j=lp=l re/; \ P / 

e;;(i)=o 

x{niI/(^l.._..E-:ii.yi«y..j}- (13) 

«SO s=l 

And tlie complete distribution Pe is given by 

At this point we can define the parameter set 0o, likelihoods Wn{G) and 
£„(6') and divergence rates D{6\6o) and D*{6\6o) in the same way as in the star- 
tree case. Indeed Theorem 1 also holds in this case. Moreover, since we do not 
exploit any specific characteristic of vr or the emission laws to prove this result, 
the proof is exactly the same as the one given in Section HI The only slightly 
difference appears when proving point 3, but it is clear that Pe(Xij.:2j > 
also holds in this case for ^ G Go- 

By analogy to the star tree case, we will establish an assumption to ensure 
that asymptotic results for n oo will imply equivalent ones for — > oo, i = 
1, . . . ,k. It also guarantees that IE0[Z„] = n, for n G N, as it is required to 
prove Theorem 2. 

En 

Assumption 2 In the multiple-hidden i.i.d. model on a binary tree Eg [ Iknt^ll] = 
Ifc, for n > 1, for any 6 ^ Q. 

This assumption holds for the multiple-hidden i.i.d. model under the TKF91 
indel evolution process as it is shown in the following lemma. 

Lemma 2 In the multiple-hidden i.i.d. model on a binary tree under the 
TKF91 indel evolution process, for any \ > Q we have ~ n, i = 1, . . . , /c, 
Fx-almost surely. 
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Proof. We have already proved this result in the case of a star phylogenetic 
tree (Lemma [T]), that is, when we have a tree without internal nodes. Now, 
the idea of the proof, is that, if at each level of the tree the expectation of 
the number of nucleotides descending (conserved plus inserted) from a single 
nucleotide in the parent sequence is 1, the expectation of the total number of 
nucleotides at each observed sequence descending from a single nucleotide in 
the root sequence will also be 1. Let us show it recursively. 

Let L be the total number of levels on the tree, that is the number of 
edges between the root and an observed sequence (in the case of a binary tree, 
L = ln2 k). For each observed sequence i G Obs, we will note d-, I = 1 . . . , L 
the l-th ancestor of i, beginning at the direct ancestor and ending at the root 
of the tree. For all i G Obs and for all n > 1 we have that 



l<j<n l<p<\ej 



where {£^j}j>i are i.i.d. Moreover, we have, for any A > 



E 

p=i 



I S.I 
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a 

where (a) comes from the fact that ||e*^ || 7^ only for those p such that e^-^ (1) 



1, and (b) comes from Lemma [T] Finally, for any i G Obs, X]g=i IIS, II 
just the number of descendants (conserved plus inserted nucleotides) of the 
nucleotide in the root in one of its direct children. The expectation of this 
quantity is again 1 by Lemma [TJ The result holds from the strong law of large 
numbers. □ 



Finally, to prove that Theorem 2 also holds in the case in which we deal 
with an arbitrary tree, we need to show that for any 9 G Qmarg (same definition 



28 



as in Section Hj) and for any observed sequence i 

P,(Z„ = (m, . . . ,n,), Xl^^ = x\.J = P,(Z„ = (ni, . . . 
But this can be easily shown from expression (fT3|) in the same way that in 

Then the asymptotic resuhs obtained in Section |4] are also valid when the 
phylogenetic tree has a general form. 

6 Simulations 

We have considered for the simulations a 3-star phylogenetic tree, the most 
simple non trivial example of multiple alignment. The branches lengths, or 
evolutionary distance from the ancestral sequence to the observed sequences, 
are set to 1 in all branches. Let us recall that this distance is not the real time 
of evolution between sequences but a measure given in terms of the number of 
expected evolutionary events per site. Indeed, under the TKF91 indel evolu- 
tion model At is the expected number of indels per site between two sequences 
at distance t. 

The distribution of the hidden process has been taken to be the distribu- 
tion of the homology structure under the TKF91 indel evolution model, that 
is, {£„}„>! are independent and identically distributed as in ([3]). However, we 
have used the equivalent multiple-HMM (see for instance Hein et ai, 2003, and 
Figure [2]) scheme to simulate the sequences. Indeed, in practice it is easier to 
simulate from a finite state Markov chain than from our i.i.d. variables on N^. 
The number of states for the Markov chain for three sequences is 15 (2^ — 1). 
The simulated sequences have been used to compute the quantities i{9) and 
w{6). The log-likelihood uJn{0) has been computed with the Forward algorithm 
for multiple-HMM (cf. Durbin et ai, 1998). Note that this algorithm com- 
putes the log-likelihood by summing over all possible alignments of the three 
sequences. However, since a homology structure is just a set of alignments, 
this is equivalent to sum over all possible homology structures, and the final 
result is exactly u}n{0)- The time complexity for a non-improved version of 
this algorithm is 0(15^?T.in2n3), where rii, 122 and are the lengths of the 
observed sequences. Computation of in{0) is done with a modified version of 
the Forward algorithm that takes into account the length of the ancestral se- 
quence. The time complexity grows now to 0(15 n 721^2^3). This is the reason 
for having limited the simulations to 3 sequences. 

The emission distributions chosen for the simulations, {/ij}jc{i.2,3} ^^'i^ f\ 
are defined by the substitution model described below. 
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6.1 The substitution model 



For the whole simulation procedure we consider the following pairwise marko- 
vian substitution model: 

Pt{x, y) ^ {(1 _ e-"*)z/(x) + e""*} otherwise, 

where a > is called the substitution rate, t is the evolutionary distance, and 
for every letter x, i^{x) equals the equilibrium probability of x. This model 
is known as the FelsensteinSl substitution model (Felsenstein, 1981). We will 
take /(■) = z/(-). We define the emission function h as 

hj{{xi)i^j) = ^ z/(i?) 

for all J C {1,2,3}. 

The equilibrium probability distribution z/(-) is assumed to be known and 
will not be part of the parameter. Then we have /(■) = /o(-)- We will set it 
to {|, |, |, |} for the whole simulation procedure. The unknown parameter is 
e = (A, a). 



6.2 Simulation results 

We have computed the functions £{6) and w{6) for two different values of ^o- 

• Ao = 0.02, ao = 0.1 and 

• Ao = 0.01, ao = 0.08. 

The substitution rate is much bigger than the insertion-deletion rate and both 
are quite small, as expected by biologists. 

The graphs of i{6) and w{6) for these parameterizations are shown in Figures 
[6] and [71 For the first parametrization we can see that w{6) seems to take 
its maximum at (Ao,tto) (Figure [6|, top left). For i{9) this is not so evident. 
Neither for any of the two functions for the second parametrization. However, 
when looking at the cuts of w{6) and i{6) for a = and A = Aq we appreciate 
that in both parameterizations both seem to take their maximums near Aq and 
ao respectively. We remark that in the two examples, the functions i{6) and 
w{9) are very close to each other. 
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Figure 6: On top: w and i for parametrization (Xq = 0.02, ao 
bottom: cuts of i and w for a = ao fixed and for A = Aq fixed. 
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7 Discussion 



The main contribution of this work is to provide a probabihstic and statistical 
background to parameter estimation in the multiple alignment of sequences 
based on a rigorous model of evolution. We describe the homology struc- 
ture of k sequences related by a star-shaped phylogenetic tree as a sequence 
of i.i.d. random variables whose distribution is determined by the evolution 
process. Given the observed sequences, the homology structure is a latent 
(non-observable) process. We formally define the latent variable model that 
emits the observed sequences, namely the multiple-hidden i.i.d. model. We 
discuss possible definitions of likelihoods in comparison with the quantities 
computed by multiple alignment algorithms. Our main results are given in 
Theorems [T] and [2], where we first prove the convergence of normalized log- 
likelihoods and identify cases where a divergence property holds. We then 
extend the definition of the model and the results obtained to the case of an 
arbitrary phylogenetic tree. 

Despite the positive results that we obtain, it is not yet possible to vali- 
date the estimation of evolution parameters under the multiple-hidden i.i.d. 
model in every situation. However, the simulation studies that we present to 
investigate situations that are not covered by Theorem [2] provide encouraging 
results. 
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